Extending Huffman Coding for Multilingual Text Compression

نویسندگان

Chi-Hung Chi

Kwok-Shing Cheng

Ling Wong

چکیده

Traditional text compression algorithms such as Huffman and LZ variants are usually based on 8-bit characters sampling. However, under the unicode representation for multilingual information, the character set of each language such as Chinese and Japanese is consisted of a very number of distinct characters and thus 16-bit or 32-bit character sampling is needed. Consequently, when text compression algorithms based on 8-bit character sampling’is applied to documents using 16-bit or 32 bit character sampling, very poor data compression ratio (average about 1.5) is obtained. In this paper, we propose two new algorithms that are based on the 16-bit or 32-bit sampling character set and on the unique features of the languages with large number of distinct characters to improve data compression ratios for multilingual text documents significantly. We choose Chinese language using 16 bit character sampling (such as Big-5 or GB code) as the representative language in our study. The first approach, called the Static Chinese Huffman Coding (Huffs &, is to introduce the concept of a single Chinese character in the Huffman tree. Experimental results on our PH corpus showed that the improvement in compression ratio obtained by ff@-s chi ranges from 20% to 29%. The second approach, called the Dictionary-Based Chinese Huffman Coding

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Performance Improvement Of Bengali Text Compression Using Transliteration And Huffman Principle

In this paper, we propose a new compression technique based on transliteration of Bengali text to English. Compared to Bengali, English is a less symbolic language. Thus transliteration of Bengali text to English reduces the number of characters to be coded. Huffman coding is well known for producing optimal compression. When Huffman principal is applied on transliterated text significant perfo...

متن کامل

Data Compression Considering Text Files

Lossless text data compression is an important field as it significantly reduces storage requirement and communication cost. In this work, the focus is directed mainly to different file compression coding techniques and comparisons between them. Some memory efficient encoding schemes are analyzed and implemented in this work. They are: Shannon Fano Coding, Huffman Coding, Repeated Huffman Codin...

متن کامل

Text Compression Algorithms - a Comparative Study

Data Compression may be defined as the science and art of the representation of information in a crisply condensed form. For decades, Data compression has been one of the critical enabling technologies for the ongoing digital multimedia revolution. There are a lot of data compression algorithms which are available to compress files of different formats. This paper provides a survey of different...

متن کامل

Efficient Data Compression Scheme using Dynamic Huffman Code Applied on Arabic Language

The development of an efficient compression scheme to process the Arabic language represents a difficult task. This paper employs the dynamic Huffman coding on data compression with variable length bit coding, on the Arabic language. Experimental tests have been performed on both Arabic and English text. A comparison is made to measure the efficiency of compressing data results on both Arabic a...

متن کامل

Comparative study of Arithmetic and Huffman Compression Techniques for Enhancing Security and Effective Bandwidth Utilization in the Context of ECC for Text

In this paper, we proposed a model for text encryption using elliptic curve cryptography (ECC) for secure transmission of text and by incorporating the Arithmetic/Huffman data compression technique for effective utilization of channel bandwidth and enhancing the security. In this model, every character of text message is transformed into the elliptic curve points 1 / 4

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2001

Extending Huffman Coding for Multilingual Text Compression

نویسندگان

چکیده

منابع مشابه

Performance Improvement Of Bengali Text Compression Using Transliteration And Huffman Principle

Data Compression Considering Text Files

Text Compression Algorithms - a Comparative Study

Efficient Data Compression Scheme using Dynamic Huffman Code Applied on Arabic Language

Comparative study of Arithmetic and Huffman Compression Techniques for Enhancing Security and Effective Bandwidth Utilization in the Context of ECC for Text

عنوان ژورنال:

اشتراک گذاری